Search CORE

17 research outputs found

TweetLID : a benchmark for tweet language identification

Author: A Xafopoulos
Aitzol Ezeiza
Arkaitz Zubiaga
C Myers-Scotton
E Baykan
F Jelinek
F Sebastiani
Iñaki Alegria
Iñaki San Vicente
JC Paolillo
José Ramom Pichel
KN Murthy
L Derczynski
M Cárdenas-Claros
M Lui
M Padró
Nora Aranberri
P McNamee
Pablo Gamallo
RD Brown
RD Brown
S Carter
Víctor Fresno
WB Cavnar
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Language identification, as the task of determining the language a given text is written in, has progressed substantially in recent decades. However, three main issues remain still unresolved: (1) distinction of similar languages, (2) detection of multilingualism in a single document, and (3) identifying the language of short texts. In this paper, we describe our work on the development of a benchmark to encourage further research in these three directions, set forth an evaluation framework suitable for the task, and make a dataset of annotated tweets publicly available for research purposes. We also describe the shared task we organized to validate and assess the evaluation framework and dataset with systems submitted by seven different participants, and analyze the performance of these systems. The evaluation of the results submitted by the participants of the shared task helped us shed some light on the shortcomings of state-of-the-art language identification systems, and gives insight into the extent to which the brevity, multilingualism, and language similarity found in texts exacerbate the performance of language identifiers. Our dataset with nearly 35,000 tweets and the evaluation framework provide researchers and practitioners with suitable resources to further study the aforementioned issues on language identification within a common setting that enables to compare results with one another

Crossref

Warwick Research Archives Portal Repository

Queen Mary Research Online

N-gram analysis of 970 microbial organisms reveals presence of biological language models

Author: A Campbell
A Poddar
A Tomovic
AL Demain
AL Demain
BR King
BY Cheng
C Woese
CD Manning
D Tauritz
DJ McFarlane
DT Pride
DW Hosmer
E Buehler
F Daeyaert
GM Pavlovic-Lazetic
Hatice Ulku Osmanbeyoglu
J Qi
JC Schmitt
JO McInerney
K Fukami-Kobayashi
K Lee
L Bahl
LG Rahme
M Ganapathiraju
M Ganapathiraju
M Ganapathiraju
Madhavi K Ganapathiraju
MW van Passel
NS Mitic
P Engel
P Meng
R Hershberg
S Karlin
S Yang
TD Heer
TS Rani
V Kešelj
VV Solovyev
WB Cavnar
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background It has been suggested previously that genome and proteome sequences show characteristics typical of natural-language texts such as "signature-style" word usage indicative of authors or topics, and that the algorithms originally developed for natural language processing may therefore be applied to genome sequences to draw biologically relevant conclusions. Following this approach of 'biological language modeling', statistical n-gram analysis has been applied for comparative analysis of whole proteome sequences of 44 organisms. It has been shown that a few particular amino acid n-grams are found in abundance in one organism but occurring very rarely in other organisms, thereby serving as genome signatures. At that time proteomes of only 44 organisms were available, thereby limiting the generalization of this hypothesis. Today nearly 1,000 genome sequences and corresponding translated sequences are available, making it feasible to test the existence of biological language models over the evolutionary tree. Results We studied whole proteome sequences of 970 microbial organisms using n-gram frequencies and cross-perplexity employing the Biological Language Modeling Toolkit and Patternix Revelio toolkit. Genus-specific signatures were observed even in a simple unigram distribution. By taking statistical n-gram model of one organism as reference and computing cross-perplexity of all other microbial proteomes with it, cross-perplexity was found to be predictive of branch distance of the phylogenetic tree. For example, a 4-gram model from proteome of <it>Shigellae flexneri 2a</it>, which belongs to the <it>Gammaproteobacteria </it>class showed a self-perplexity of 15.34 while the cross-perplexity of other organisms was in the range of 15.59 to 29.5 and was proportional to their branching distance in the evolutionary tree from <it>S. flexneri</it>. The organisms of this genus, which happen to be pathotypes of <it>E.coli</it>, also have the closest perplexity values with <it>E. coli.</it> Conclusion Whole proteome sequences of microbial organisms have been shown to contain particular n-gram sequences in abundance in one organism but occurring very rarely in other organisms, thereby serving as proteome signatures. Further it has also been shown that perplexity, a statistical measure of similarity of n-gram composition, can be used to predict evolutionary distance within a genus in the phylogenetic tree.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

D-Scholarship@Pitt

A Natural Language Query Interface for Tourism Information

Author: C Silverstein
J Nielsen
M Dittenbach
WB Cavnar
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2003
Field of study

Crossref

Multivariate patent analysis—Using chemometrics to analyze collections of chemical and pharmaceutical patents

Author: Bird S
Cavnar WB
Joachims T
McCallum A
Pedregosa F
Publication venue: 'Wiley'
Publication date
Field of study

Crossref

Exploratory study on risk management in open innovation

Author: DG Rajpathak
FP Appio
G Castellion
J Rosas
N Hewitt-Dundas
WB Cavnar
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/09/2017
Field of study

Open innovation is a strategy with increasing adoption by value-seeking companies using or sharing technology with the outside world. But this strategy is also accompanied by risk. However, risk management seems to have been overlooked by researchers on open innovation networks. This exploratory work clarifies to what extent the issue of risk has been considered in open innovation research. Presented results are based on interviews and analysis of existing literature on open innovation.info:eu-repo/semantics/publishedVersio

Repositório Científico do Instituto Politécnico de Lisboa

Crossref

GPU Based N-Gram String Matching Algorithm with Score Table Approach for String Searching in Many Documents

Author: CS Kouzinopoulos
DE Knuth
E Ukkonen
I Moraru
J Sharma
L Mussi
M Góngora-Blandón
RM Karp
RS Boyer
S Tomov
WB Cavnar
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref